In this lesson, we will compare procedural and declarative approaches to computing aggregate values (mean, maximum value) from time series of concentrations at a single site.
In general, you will find that an R script often follows a set of common operations:
Load libraries
library(tidyverse)
library(chron)
source("functions_extra.R")
Define options
Sys.setlocale("LC_TIME","C")
## [1] "C"
options(stringsAsFactors=FALSE)
options(chron.year.abb=FALSE)
theme_set(theme_bw()) # just my preference for plots
The data is available from the National Air Pollution Monitoring Network (NABEL) of Switzerland.
We have downloaded hourly time series from 2013 for Lausanne from the NABEL data query, and placed this file in a folder called “data/2013/” located in the subdirectory of the working directory.
First, check your working directory:
getwd()
Define your input file relative to this path:
filename <- file.path("data", "2013", "LAU.csv")
file.exists(filename)
## [1] TRUE
Read data table:
data <- read.table(filename,sep=";",skip=6,
col.names=c("datetime","O3","NO2","CO","PM10","TEMP","PREC","RAD"))
Check a sample of your data:
head(data)
## datetime O3 NO2 CO PM10 TEMP PREC RAD
## 1 31.12.2012 01:00 7.8 56.3 0.5 16.1 3.8 0 -2.4
## 2 31.12.2012 02:00 22.4 38.0 0.4 11.6 4.1 0 -2.3
## 3 31.12.2012 03:00 14.5 37.2 0.3 10.3 3.1 0 -2.1
## 4 31.12.2012 04:00 28.7 25.4 0.3 10.5 3.5 0 -2.2
## 5 31.12.2012 05:00 19.6 33.7 0.3 9.0 2.9 0 -2.2
## 6 31.12.2012 06:00 30.8 51.2 0.3 8.7 3.2 0 -2.3
Check the structure of your object:
str(data)
## 'data.frame': 8784 obs. of 8 variables:
## $ datetime: chr "31.12.2012 01:00" "31.12.2012 02:00" "31.12.2012 03:00" "31.12.2012 04:00" ...
## $ O3 : num 7.8 22.4 14.5 28.7 19.6 30.8 25.1 28.3 19.4 23.7 ...
## $ NO2 : num 56.3 38 37.2 25.4 33.7 51.2 51.7 44.2 60.6 53.8 ...
## $ CO : num 0.5 0.4 0.3 0.3 0.3 0.3 0.4 0.3 0.4 0.4 ...
## $ PM10 : num 16.1 11.6 10.3 10.5 9 8.7 8.6 9.7 10.2 11.2 ...
## $ TEMP : num 3.8 4.1 3.1 3.5 2.9 3.2 3.3 2.9 2.8 3.5 ...
## $ PREC : num 0 0 0 0 0 0 0 0 0 0 ...
## $ RAD : num -2.4 -2.3 -2.1 -2.2 -2.2 ...
Check column classes:
ColClasses(data)
## datetime O3 NO2 CO PM10 TEMP PREC RAD
## 1 character numeric numeric numeric numeric numeric numeric numeric
Convert date/time to useful data types - the hourly timestamps in the file denote the end of the measurement periods so we will convert them to start times by subtracting one hour (out of 24):
data[,"datetime"] <- as.chron(data[,"datetime"], "%d.%m.%Y %H:%M") - 1/24
data[,"month"] <- months(data[,"datetime"])
data[,"date"] <- dates(data[,"datetime"])
Check data sample, structure, and column classes:
head(data)
## datetime O3 NO2 CO PM10 TEMP PREC RAD month date
## 1 (12/31/2012 00:00:00) 7.8 56.3 0.5 16.1 3.8 0 -2.4 Dec 12/31/2012
## 2 (12/31/2012 01:00:00) 22.4 38.0 0.4 11.6 4.1 0 -2.3 Dec 12/31/2012
## 3 (12/31/2012 02:00:00) 14.5 37.2 0.3 10.3 3.1 0 -2.1 Dec 12/31/2012
## 4 (12/31/2012 03:00:00) 28.7 25.4 0.3 10.5 3.5 0 -2.2 Dec 12/31/2012
## 5 (12/31/2012 04:00:00) 19.6 33.7 0.3 9.0 2.9 0 -2.2 Dec 12/31/2012
## 6 (12/31/2012 05:00:00) 30.8 51.2 0.3 8.7 3.2 0 -2.3 Dec 12/31/2012
str(data)
## 'data.frame': 8784 obs. of 10 variables:
## $ datetime: 'chron' num (12/31/2012 00:00:00) (12/31/2012 01:00:00) (12/31/2012 02:00:00) (12/31/2012 03:00:00) (12/31/2012 04:00:00) ...
## ..- attr(*, "format")= Named chr [1:2] "m/d/y" "h:m:s"
## .. ..- attr(*, "names")= chr [1:2] "dates" "times"
## ..- attr(*, "origin")= Named num [1:3] 1 1 1970
## .. ..- attr(*, "names")= chr [1:3] "month" "day" "year"
## $ O3 : num 7.8 22.4 14.5 28.7 19.6 30.8 25.1 28.3 19.4 23.7 ...
## $ NO2 : num 56.3 38 37.2 25.4 33.7 51.2 51.7 44.2 60.6 53.8 ...
## $ CO : num 0.5 0.4 0.3 0.3 0.3 0.3 0.4 0.3 0.4 0.4 ...
## $ PM10 : num 16.1 11.6 10.3 10.5 9 8.7 8.6 9.7 10.2 11.2 ...
## $ TEMP : num 3.8 4.1 3.1 3.5 2.9 3.2 3.3 2.9 2.8 3.5 ...
## $ PREC : num 0 0 0 0 0 0 0 0 0 0 ...
## $ RAD : num -2.4 -2.3 -2.1 -2.2 -2.2 ...
## $ month : Ord.factor w/ 12 levels "Jan"<"Feb"<"Mar"<..: 12 12 12 12 12 12 12 12 12 12 ...
## $ date : 'dates' num 12/31/2012 12/31/2012 12/31/2012 12/31/2012 12/31/2012 ...
## ..- attr(*, "format")= Named chr [1:2] "m/d/y" "h:m:s"
## .. ..- attr(*, "names")= chr [1:2] "dates" "times"
## ..- attr(*, "origin")= Named num [1:3] 1 1 1970
## .. ..- attr(*, "names")= chr [1:3] "month" "day" "year"
ColClasses(data)
## datetime O3 NO2 CO PM10 TEMP PREC RAD
## 1 chron,dates,times numeric numeric numeric numeric numeric numeric numeric
## month date
## 1 ordered,factor dates,times
View raw ozone concentrations:
ggplot(data)+
geom_line(aes(datetime, O3))+
scale_x_chron()
Solve by looping. Note that we “grow” a data frame by the function rbind.
unique.months <- levels(data[,"month"])
O3.monthly <- NULL
for(.month in unique.months) {
table <- filter(data, month == .month)
tmp <- data.frame(month=.month, O3=mean(table[,"O3"], na.rm=TRUE))
O3.monthly <- rbind(O3.monthly, tmp)
}
print(O3.monthly)
## month O3
## 1 Jan 22.68531
## 2 Feb 40.58036
## 3 Mar 35.06937
## 4 Apr 50.70181
## 5 May 51.88733
## 6 Jun 59.64025
## 7 Jul 76.14097
## 8 Aug 66.58543
## 9 Sep 43.10642
## 10 Oct 24.73957
## 11 Nov 25.89110
## 12 Dec 18.61682
Convert month column from character string to “factor”:
class(O3.monthly[,"month"])
## [1] "character"
O3.monthly[,"month"] <- factor(O3.monthly[,"month"], unique.months)
class(O3.monthly[,"month"])
## [1] "factor"
Another visual representation:
ggplot(O3.monthly) +
geom_bar(aes(month, O3), stat="identity") +
scale_y_continuous(expand=expansion(mult=c(0, 0.1)))
Calculate:
unique.dates <- unique(data[,"date"])
O3.dailymax <- NULL
for(.date in unique.dates) {
table <- data %>% filter(date == .date)
tmp <- data.frame(date=.date, O3=max(table[,"O3"], na.rm=TRUE))
O3.dailymax <- rbind(O3.dailymax, tmp)
}
head(O3.dailymax)
## date O3
## 1 15705 47.9
## 2 15706 61.2
## 3 15707 70.4
## 4 15708 47.9
## 5 15709 22.9
## 6 15710 36.8
Convert date column to chron object:
class(O3.dailymax[,"date"])
## [1] "numeric"
O3.dailymax[,"date"] <- as.chron(O3.dailymax[,"date"])
class(O3.dailymax[,"date"])
## [1] "dates" "times"
Inspect:
head(O3.dailymax)
## date O3
## 1 12/31/2012 47.9
## 2 01/01/2013 61.2
## 3 01/02/2013 70.4
## 4 01/03/2013 47.9
## 5 01/04/2013 22.9
## 6 01/05/2013 36.8
tail(O3.dailymax)
## date O3
## 361 12/26/2013 57.4
## 362 12/27/2013 42.0
## 363 12/28/2013 62.6
## 364 12/29/2013 49.9
## 365 12/30/2013 39.0
## 366 12/31/2013 24.1
Plot ECDF (empirical cumulative distribution function):
ggplot(O3.dailymax) +
geom_line(aes(O3),stat="ecdf") +
labs(y = "ECDF")
With a single expression, we reproduced the loop used to create O3.dailymax:
data %>%
group_by(month) %>%
summarize(O3 = mean(O3, na.rm=TRUE))
## # A tibble: 12 x 2
## month O3
## * <ord> <dbl>
## 1 Jan 22.7
## 2 Feb 40.6
## 3 Mar 35.1
## 4 Apr 50.7
## 5 May 51.9
## 6 Jun 59.6
## 7 Jul 76.1
## 8 Aug 66.6
## 9 Sep 43.1
## 10 Oct 24.7
## 11 Nov 25.9
## 12 Dec 18.6
We can easily extend to two variables:
data %>%
group_by(month) %>%
summarize(O3 = mean(O3, na.rm=TRUE),
NO2 = mean(NO2, na.rm=TRUE))
## # A tibble: 12 x 3
## month O3 NO2
## * <ord> <dbl> <dbl>
## 1 Jan 22.7 47.5
## 2 Feb 40.6 45.1
## 3 Mar 35.1 50.3
## 4 Apr 50.7 40.5
## 5 May 51.9 38.7
## 6 Jun 59.6 38.9
## 7 Jul 76.1 38.5
## 8 Aug 66.6 34.1
## 9 Sep 43.1 42.0
## 10 Oct 24.7 40.6
## 11 Nov 25.9 36.3
## 12 Dec 18.6 57.0
We first transform the data frame:
lf <- gather(data, variable, value, -c(datetime, month, date))
Let us inspect this transformation:
head(lf)
## datetime month date variable value
## 1 (12/31/2012 00:00:00) Dec 12/31/2012 O3 7.8
## 2 (12/31/2012 01:00:00) Dec 12/31/2012 O3 22.4
## 3 (12/31/2012 02:00:00) Dec 12/31/2012 O3 14.5
## 4 (12/31/2012 03:00:00) Dec 12/31/2012 O3 28.7
## 5 (12/31/2012 04:00:00) Dec 12/31/2012 O3 19.6
## 6 (12/31/2012 05:00:00) Dec 12/31/2012 O3 30.8
tail(lf)
## datetime month date variable value
## 61483 (12/31/2013 18:00:00) Dec 12/31/2013 RAD -2.4
## 61484 (12/31/2013 19:00:00) Dec 12/31/2013 RAD -2.4
## 61485 (12/31/2013 20:00:00) Dec 12/31/2013 RAD -2.3
## 61486 (12/31/2013 21:00:00) Dec 12/31/2013 RAD -2.2
## 61487 (12/31/2013 22:00:00) Dec 12/31/2013 RAD -1.6
## 61488 (12/31/2013 23:00:00) Dec 12/31/2013 RAD -1.3
ColClasses(lf)
datetime month date variable value
1 chron,dates,times ordered,factor dates,times character numeric
This is amenable for plotting:
ggplot(lf) +
facet_grid(variable~., scale="free_y") +
geom_line(aes(datetime, value))+
scale_x_chron()
Using this, we can aggregate using three approaches:
group_by: as illustrated abovestat_summary: called through geom operationgroup_by operationresult <- lf %>%
group_by(month, variable) %>%
summarize(value = mean(value, na.rm=TRUE))
The inverse operation of gather:
spread(result, variable, value)
## # A tibble: 12 x 8
## # Groups: month [12]
## month CO NO2 O3 PM10 PREC RAD TEMP
## <ord> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Jan 0.444 47.5 22.7 22.5 0.0872 46.3 2.23
## 2 Feb 0.439 45.1 40.6 28.8 0.0711 85.0 0.592
## 3 Mar 0.485 50.3 35.1 35.9 0.107 98.9 4.75
## 4 Apr 0.393 40.5 50.7 25.5 0.119 171. 10.6
## 5 May 0.334 38.7 51.9 12.3 0.187 174. 12.0
## 6 Jun 0.338 38.9 59.6 17.6 0.124 249. 17.7
## 7 Jul 0.368 38.5 76.1 19.9 0.164 273. 22.4
## 8 Aug 0.333 34.1 66.6 17.3 0.0540 242. 20.9
## 9 Sep 0.395 42.0 43.1 17.5 0.125 157. 17.0
## 10 Oct 0.413 40.6 24.7 17.5 0.287 82.6 13.7
## 11 Nov 0.369 36.3 25.9 13.6 0.157 53.9 5.99
## 12 Dec 0.525 57.0 18.6 25.7 0.132 46.7 3.87
stat_summaryThe mean cal also be calculated in the process of plotting:
ggplot(lf) +
facet_grid(variable~., scale="free_y") +
geom_bar(aes(month, value), stat="summary", fun="mean") +
scale_y_continuous(expand=expansion(mult=c(0, 0.1)))
Add errorbars to denote the full range in values:
ggplot(lf, aes(month, value)) +
facet_grid(variable~., scale="free_y") +
geom_bar(stat="summary", fun="mean") +
geom_errorbar(stat="summary",
fun.min=min, #function(x) quantile(x, .25),
fun.max=max, #function(x) quantile(x, .75))+
width=0.1)